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MATCHING AND RANKING 
LEGAL CITATIONS 

BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

5 The invention relates to retrieval of legal information by data processing systems. 

2. Badcg round 

The essence of legal research in common law jurisdictions is the retrieval of relevant 
legal information. Within the worid of the common law. cases are one of the most 
important sources of legal information. Unlike legislation, they have no official 

10 iiKlexing or form of informational organization. Legal cases, as produced by the 
courts, generally lack defined structural elements such as titles, abstracts and 
subsections. Legal publishers and some legal electronic information vendors devote 
a great deal of resources towards the indexing, summarization and classification of 
cases and other documents. Private sector publishers often place this added'' 

15 information at the front of a case in the form of what has become kiK>wn as a 
headnote. This pmnits the user to quickly ascertain whether or not the document 
is likely to be relevant. These headnotes are prqiarcd by legally trained people, at 
a si gn i fi can t cost, and die publishers retain copyright in them. Without headnotes, 
a researcher may have to spend a great deal of time reading through a case in order 
20 to ascertain whedjer or not it is of interest. 

An ideal legal information system ought to receive queries in a very simple form that 
allows the user to specify the issues of interest and to intelligently identify documents 
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addressing issues specified by the query. The user ought to be able to look at the 
first screen of a retrieved document and come to some conclusion as to the relevancy 
of it in regard to the search query. Returned documents ought to be ordered in 
terms of the degree of relevancy to the query. The system should be rich in relevant 
5 hypertext links. 

The law is generally found in four different kinds of documents. The law is foimd 
in a non-authoritative form in legal treatises. It is found in authoritative form in the 
precedents aiKi written decisions of the courts. It is fouiul in legislative fiats such as 
statutes* codes aiKl regulations. AikI it is fouiKi in a non-authoritative form in the 
10 framework of facts and arguments which define actual litigation. 

Coinciding with these kinds of sources, one can create a profile of any legal 
document in terms of concepts, case citations, statute citations, and facts. Such a 
profile of a legal document can be used throughout an information system to form 
queries, to summarize the document, and to place hyper-text links automatically 
15 between the sununahes and the full text of the document. 

A case, as a legal document, can be represented in terms of a profile of (i) the 
relevant inq>ortant coiK:epts, (ii) the citation to the leading cases on the topic, 
(iii) the relevant legislative provisions, if applicable, and (iv) the significant factual 
terms. Thus a case dealing with a felony nuirder where a convenience store owner 
20 dies of a heart attack shortly after a robbery could be represented by the profile 
shown in the following table. 
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CONCEPTS 
felony-murder 


CITED CASES 

People V. Cahill (1995) 22 Cal.App.4th 296 


CITED STATUTES 
Crim. Code § 187 


FACTS 

convenience store 

robbery 

heart attack 



Different cases dealing with the same legal issues but different fact patterns can be 
represented by changing the significant terms of the facts quadrant. In the same way, 
the terms of the facts quadrant can remain fixed, but different coix:epts, case 
citations, and statutory provisions can be used to represent cases on similar facts but 
10 dealt with in terms of different legal doctrines. 

A legal text management tool that exploits this form of rq)resentation of legal 
material is FLEXICON (Fast Legal EXpert Information CONsuItant), which has 
been described, for example, in the foUowing articles, the disclosures of which are 
iiK»)rporated herein by this reference: jGelbart, D. et al., "Flexicon: An Evaluation 

15 Of A Statistical Ranking Model Adapted To Intelligent Legal Text Management", 
Proceedings of the Fourth International Conference on Artificial Intelligence and 
Law, University of British Cohmibia, Vancouver, Canada (1993); Gelbart, D. et al.. 
"Beyond Boolean Search: FLEXICON, A Legal Text-Based Intelligent System", 
Proceedings of tte Third Conference on Artificial Intelligence & Law, University of 

20 British Cotambia. Vancouver, Canada, pp. 225-234 (1991); Gelbart, D. et al. , "Toward 
A Comprehensive Legal Information Retrieval System", Database and Expcn 
Systems Applications, Proceedings of the International Coiiference, Vienna, Austria, 
pp. 12M25 (1990); Gelbart, D. et al., "Current Issues in Text Retrieval: 
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FLEXICON» A Legal Texi-Based Intelligent System",Univcrsity of British Columbia, 
Faculty of Law Artificial Intelligence Research Project. Vancouver, Canada, pp. 1-5 
(1989); Gelbart, D. et al., "Towards Combining Automated Text Retrieval arKi Case- 
Based Expert Legal Advice", Law Technology Journal, CTI Law Technology Centre 
5 and Bileta, Vol. 1, No. 2, pp. 19-24 (1992); Gelbart, D. et al.. "Effective Legal 
Information Retrieval Systems", Council of Canada and The Law Foundation of 
British Columbia; and Gelban, D. et al., "Flexicon. A New Legal Information 
Retrieval System**, Canadian Law Libraries, Bibliotheques de Droit Canadiennes, 
Vol. 16, No. l,pp. 9-12(1991). 



10 SUMMARY OF THE INVENTION 



In a system such as FLEXICON, significant preprocessing of the cases or other legal 
documents is necessary to extract the information required to fill out the documents* 
profiles. The invention provides computer-implemented methods useful in 
automating this preprocessing. Complementary to this preprocessing, the invention 
15 also provides methods useful in ranking the results of a search where the query is 
formulated in terms of document profile categories. 

In general, in one aspect, the invention provides a method for matching a current 
case reference to a set of case references. The method includes maintaining a 
database of case references that have been processed by the method, parsing the 
20 current reference for its citations, and for each citation, parsing a citation into its 
vohmie, reporter, and page, and searching the database for a matching case by 
applying a set of tests to the cases in the database, the tests incliiding: ' ilT a ciuodida^ 
case matches the current case in two volume-reporter-page citations from differem 
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reporters, the candidate case is a match. In another aspect, the invention includes 
parsing the current case reference for its party names, and for each non-noise word 
in each party name, acquiring a sound-alike vahie for the non-noise word; and 
searctiing the database for a matching case by applying a set of tests including: if a 
S candidate case matches the current case in one citation, and the cases* names match 
at least loosely, and neither case has a citation that is inconsistent with the citations 
of the other case, tbe^cahdidate case is a match. 

In another aspect, the invention includes the steps of initializing a candidates set to 
be empty; adding to the candidate set a case reference from the database if it has a 
10 same party name as the current reference, by sound-aliice values; adding to the 
candidate set a case reference from the database if the case reference has a volume- 
reporter-page citation matching the current citation; and searching the r^pdidatr set 
rather than the entire database of case references for a inatrhing case by applying 
a set of tests to the cases in the candidate set. In another aspect, the method 
15 includes applying a set of tests inchiding: if a candidate case matches the current 
case in two volume-reporter-page citations from differem reporters and the court 
information for the two cases is rsot inconsistent, the candidate case is a match; if a 
ca ndid a te case matches the current case in two volume-reporter-page citations frtnn 
different reporters, and the year information is not inconsistent, the candidate case 
20 is a match; and if a candidate case matches the currem case in one citation and both 
the court information and the year information for the two cases is not iiKronsisteiu, 
and neither case has a citation that is inconsistent with the citations of the other case, 
the candidate case is a match. In awther aspect, the invention izK:ludes q>plying 
tests including: if a candidate case matches the current case in one citation, aiKl the 
25 courts and the years also n^tchrtiie cases* names very tightly, and the cases have less 
than two citations that are inconsistent from one case to the other, the candidate is 
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match; and, if a candidate case matches the cmrent case in both courts and years, the 
cases' names very tightly, and neither case has a citation that is inconsistent with the 
citations of the other case, the candidate case is match. 

In general, in another aspect, the invention provides a method for ranking the 
5 relevance of a target document found by a search query in a set of documents. The 
method includes providing a set of weighting factors defined by user for the search 
query, at least one of which weighting factors differs from the others aiKi at least one 
of which weighting factor has a negative vahie; applying a metric function to the 
search query, the weighting factors, and the target document to produce a similarity 
10 measure; and ranking the target document by its similarity measure. In another 
aspect, the method includes the calculation of an inner product as pan of the metric 
function. 

In general, in aiK>ther aspect, the invention provides a method for ranking the 
relevance of a target documem found by a search query in a set of documents, where 

15 the docimient terms aiKl the search terms are each of one of the plurality of types. 
The method inchides, for each type in the plurality of types, jqjplying a metric 
function to those terms of the search query and the target document having that ^pe, 
to produce a type-based similarity measure; combining the type-based similarity 
measures by flying a user-selectable type-based weight to each of the target 

20 document's type-based similarity measures to produce a final similarity measure; and 
ranking the target document by the final similarity measure. In another aspect, die 
method in c h i d es combining the type-based similarity measures ami applying a factor 
to the similarity measures ba^ on number of different types for which some search 
i&Tm m a tch ed die target document. In another , aspea, the method includes providing 

25 a set of weighting factor for each search query, at lease one of which weighting 
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factors differs from the others, and applymg a metric function to the search query, 
the weighting factors, and the target document to produce a similarity measure. 

The invention has a number of advantages. 

For example, the case matching method provides ideal case citations (case 
5 references) ' that improve the accuracy and usefulness of case name- awf^citation- 
based searches and hypertext links and that, when used with automatic case 
extraction, in affect correct extraction errors. The case matching method, by 
recognizing references to the same case, improves the usefulness of statistics that 
characterize a legal text by the frequency with which it cites cases. Similarly, the 
10 method improves the qualiQr of statistics taken on a database of cases as a whole. 
Also, the search result ranking method improves the usefulness of search results by 
allow the user (requestor) to weigh die inqx)rtance of term in a search request and 
to increase the ranking of documents that match on multq>le quadrants. 

Other advantages and features will become apparent &om the following description 
15 ^^frpm the claims. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Tlie accompanying drawings, which are incoiporated in, and constimte a part of, the 
specification, schematically illustrate specific onbodiments of the invention and, 
together with the general descrqnion given above and the detailed descr^tion of the 
20 embcxuments given below, serve to ejq)lain the principles of the invention. 
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FIGURE 1 is a block diagram of a data processing aiiangetnent supporting a method 
for processing legal text. 

FIGURE 2 is a flowchart of a method for matching case references. 
FIGURES 3 is a flowchan of a method for ranking cases fomid in a search. 

S DETAILED DESCRIPnON 

Turning to FIGURE 1, a preprocessor for a legal text management system is 
inq)lcmemed as a computer program on a computer 10 coupled to mass data store 
12 on which are stored one or more sets of data, preferably stored in one or more 
databases managed by a database management system (not shown). Also, available 
10 for use by a user (generally human, but possibly also an application program) is a 
computer interface 18 through which search requests can be made of the data on 
data store 12, and through which information can be provided to the user. 

The preprocessor takes as iiqmt a document 14 (which may be a data file, part of a 
data file, or other form of text iiqmt) containing the text of a judicial opinion (a case) 
IS or other legal text. (For clarity of exposition, the following descripdon is given solely 
in terms of a case; however, it will be clear that the methods described are applicable 
to the processing of other kinds of legal material.) 

Document 14 may have formatting information (such as underlining marks) in it in 
addition to plain text. Such information may be - useful' in 'preprocessing the 
20 documem ~ for example, the names of cases cited in judicial opinions are often 
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written in italics or underiined - but the methods described herein are intended for 
fully generally application and so do not rely on such non-textual information. 

The preprocessor recognizes and extracts from the text of document 14 case 
references and stamte references, as wUl be described. A full case reference 
5 normally has a case name, which is normally formed from the names of the parties 
(e.g., "Lochner v. New York"), one or more citations, each indicating a vohune 
number, reporter, and page number (e.g., "198 U.S. 45*), and a date. (The tenn 
"citation" is used in two senses: first, to refer to the entire reference, iiK:luding the 
case name; and second, more specifically, to refer to the volume-reporter-page 
10 information. Where appropriate to avoid ambiguity, the unidiomatic terms "case 
reference" or "reference" wifl be used to denote the former.) 

Turning to FIGURE 2, the preprocessor parses the case name at step 210 and the 
case citation (or citations) at step 212. Cited cases (case references) are recognized 
by the preprocessor by means of template matching. Both the form of the case 

IS names ("v. "and "In re" primarily) and the form of the case citations (the familiar 
"volume, xcponex, page" sequence) are matched by different template matching 
mechanisms. If a portion of text matches a case name template, the preprocessor 
searches subsequent to the name for citations to reporters. If a reporter citation is 
foimd first, the preceding text is examii^ to see if it contains a case name in a form 

20 not recognized by the general purpose case name tenq>late matcher. This process 
exploits the conventional use of c^italization. The case names are parsed mto 
names of parties; and the citations are parsed into volume, reporter, page, coun, and 
year, to the extent die information appears in the case reference. 
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Once a case has been found in a document, subsequent references to the case in the 
document by means of shortened forms of the case name (e.g., "... in the Smith 
case, si4>ra, . . .")are recognized. (The preprocessor assumes that a party name from 
a previously cited case (unless it is the name of a party in the present case), when it 
5 appears in the text, is a reference to the previously cited case.) 

- v. ^ When a case reference has been extracted, it is added to a database 16 of case 

references, as will be described. As each new reference is added to the database, a 
check is made for a match with the references already in the database. If one is 
found, steps 216 and 218, any new information about the case from the currem 

10 reference (e.g., a parallel citation to an unofficial reporter) is added to the database 
record at step 220; otherwise, a new record is made for the reference at step 222. 
(The record for a reference may be stored as a single record, in the technical 
database sense, but for purposes of this description the temi should be understood 
more broadly to include any coherent set of data, stored and accessible in the 

15 database, that includes information about the case to which the case reference 
refers.) At optional step 214, a subset of the database — a candidate set of cases — 
is extracted before matching is done at step 216, to include cases that match on any 
of the parts of the case reference. The extraction of a candidate set is done, in one 
embodiment, only for the sound-alike terms (which will be described), for the sake 

20 of efficiency. 

The matching of case references (step 216) against the database 16 is done with 
heuristic algorithms that rely principally on citations and secondarily on names. 
Thus, for each citation in the reference, a search may be made of the database and 
' ^ ^^^-^^ the results evahiatied ' by the case matching filter, described below! If no match ^^^^^ 
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found for ok citatioa« the next citation, if any, of the reference is considered until 
all citations have been tried. 

If no matches are found based on the reference's citation(s), the preprocessor filter 
(step 216) tries the reference's party names, as follows. A sound-alike value is 
S created for each non-noise word of each party name using, for example, any one of 
the class of aljgorithms commonly known as "soimdex algorithms". Thdse^assocuxe 
a unique number (which will be called the "soundex value") with words that sound 
alike. (A frequently used example is "Smith" and "Smythe".) The database record 
for a case reference inchides a soundex value for each non-noise part of each party 
10 name. The party name soimdex values for the current reference are used to search 
the database. The records retrieved are sorted based on number of matches, and 
some number of the candidates from the top of the sort order are tested with the 
case matching filter until either a match is found or the candidates are exhausted. 

The main module of a case marching filter is inq)lemented in the C++ method 
IS same_case set forth in the table below, and in the routines it calls, whose functions 
are described by their names and the comments associated with them; 



/♦ This function returns TRDE if the two cases passed to it are the same 
and otherwise it returns FALSE. */ 

20 int case_table_type : : same_case ( f ull_case_inf o_type ♦x , 

case info type ♦y) 

{ " " 

int cononon^cites; 

int same_date ; 
25 int . -same^^oiirt;-'^*^ ^''^ ' ^ ... . . 

int naroe_match; 

int inconsistent_cites ; 

int all_cites_match; 
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conanon^cites « cites_iii_^coiii»on<x« y, ^inconsistent cites, 
&all_cite3_match) ; ~ 
same_date « compare^dates (x, y) ; 
same_court = conipare_courts (x, y) ; 
5 name^match m coinpare]|[naiRes (x, y) ; 

if (connnon_cites >« 2 && same_court) 

return (TRUE) ; 
i£ (common^cites >« 2 sa]ne_date) 

return < TRUE) ; 

10 if (coinmon_cites ■» 1 && name_match >= LOOSE && inconsistent cites < 1) 
return CTRDE); *" 
if (com raon ^cites mm i same_date && same_court && name_match 
VERY_TIGHT~M inconsi8tent_citcs < 2) 
" re t\xm ( TRUE ) ; 

15 if {cofanon_cites -« 1 && saine_date && same_court && all_cites_tnatch) 
return (TRITE) ; 

if (name_match «- VERY^TIGHT && sarae_date && saine_court 
inconsistent_cites < 1) 
return (TRUE) ; 
20 retum(FALSE) ; 
} 

/* This ftmction compares the dates of tvo cases and returns TRUE if 
they are the same and FALSE otherwise. */ 

int case_table_type: :compare_dates (f\ill_case_info type ♦x, 

25 case_info_type *y) ^ 

if (x->year y->year | j y->year UNKNOWN j j x->year -« UNKNOWN) 

return (TRUE) ; 
rettim( FALSE) ; 
30 } 

/* This function returns the number of cites the tvo cases passed to it 
have in comnon, and also returns the niimber of inconsistent cites the 
two cases have in common. Inconsistent- cites are two different cites to 
the same reporter. If the case with the lesser number of cites all match 
35 cites of the other case, then all match is TRUE, otherwise FALSE. */ 

int case__table_type : : cites_in_common ( f ull_case_inf o_type ♦x , 

case_info_type ♦y, int ♦incon^cites, int *all_match) 

int conanon_cites » 0; 

40 int inconsistent cites « FALSE; 

cite_type x_cites [MAX_CITESJ , y_cites [MAX_CITESJ ; 

int xi - 0; 

int yi - 0; 

int min_no_cites ; 

45 stresunof f y_pos ; 

dis)c link_cite type y_ct; 

ram_Tin)c_cite_type ♦rl ; 

rl a &(x->cite} ; 

while (rl NULL &fc rl->cite.valid() ) { // read x cites 
50 x_cites[xi] - rl->cite; 

rl o rl->next; 
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10 



} 

x_cites [xi] . invalidate ( ) ; 

y_pos - y->cites; // read y cites 

vrtiile (y_pos !» END) { 

citejtr_f ile.seekgCyjws, ios: ibeg) ; 

cite_ptr_f ile » y_ct; 

y_cites [yi] • y_ct . cite ; 

y_j>os • y_ct.next cite; 

yi++; " 

y_cites[yil .invalidate () ; 



min_no_cites - (xi < yi) ? xi : yi; 
if (xi ! » 0 | j yi ■ - O) 
15 inconsi3tent_cites « CITES_EXIST; 

xi « yi a 0; // compare cites 

while (x_cites (xi] .valid () ) { 

while (y_ci tea (yi) . valid 0 ) { 

if (x_cites[xi) y^citesCyi]) { 
20 common cites-^+7 

} 

else if ( (x_cites [xi] .normal. unused -» KO^YEAR 6& 
y_cites[yi) .normal. unused »» NO_yKAR && x_ci tea [xi] .normal .reporter 
y cites [yi) .normal . reporter) j | 
25 (x cites Cxi) .normal. unused !- NO YEAR && 

y_cites[yi) .normal. unuse3 i» NO_YKAR && x_cites [xi] .normal. reporter — 
y cites (yi) .normal .reporter) ) { 

inconsistent_cites « TRUE; 

30 yi*+; ^ 

yi - 0; 
xi++; 

} 

— '--^35 if (x_cites[0) .valid () 6& y_cites [0] .valid () && common_cites «- 0) 
inconsistent_cites « NO_CITE_HATCH; 
♦incon_cites » inconsTstent^cites; 

if (min_no_cites > 0 && min_jio_cites »« contmon^cites) 
*all match » TRUE; ~ 

40 else 

*all_match - FALSE; 
return (conmon^cites) ; 

/♦ This function rettims TROT if two courts are the sewne and FALSE 
45 otherwise. Two courts are counted as being the same if all of the 

shorter coiurt abbreviations letters are found in the same order in the 
longer court abbreviation. Therefore 'B.C.S.C wuld match with 'S.C 
or 'SC . ♦/ 



int case_table_type : :compare_courts {full_case_info_type ♦x, 
50 case info type ♦y) 
{ ■ ■ 

char ct [MAX_COURT_LQy) ; 

char ♦cpl, *cp2; 
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if <y->court «- NO_COORT || x->court(Ol ■« '\0') 

return (TRUE) ; 
if (y->court -« co\irt->court_nuinbcr (x->c<mrt) ) 
return (TRUE) ; * 
5 court ->court_atring(ct, y->court) ; 

Cpl • Ct; " 

cp2 ■ x->court; 

if {nuinber_letters<cpl) > number^letters (cp2) ) { 
cpl « x->court; 
10 cp2 1 ct; 

while(*cpl != '\0') { 

if <*cp2 — '\0') 

return (FALSE) ; 
15 if <*cpl — ♦cp2) { 

cpl**; 
cp2++; 

else { 

20 cp2++; 

} 

while (♦cpl « ' ' j| ♦cpl == '.') 
Cpl*+; 

while (♦cp2 — * ' j| ♦cp2 " '.') 
25 cp2**; 

return (FALSE) ; 
} 

/♦ This ftinction compares names of cases. If the names are very close it 
30 returns TXGST, If the names match somewhat it returns LOOSE, otherwise 
it retuacns FALSE. ♦/ 

int case_table_type: :compare_names (full_case_info_type ♦x, 

case_info_type ♦y) 

35 char name (HAX_CAS£_LEN) ; 

strearooff name_j>tr7 

cas e_name_pt r_type name_l ink ; 

int match, teinp_match; 

match - FALSE; 
40 namejptr - y->names; 
do { 

name_ptr_f ile . seeicg (namejtr, ios : :beg) ; 
namejptr file » name^linJc; 

names_f ile.8ee)cg(name3lin)c. case name Dtr, ios: : beg) ; 
45 xuunes_f ile . get (name , MAX_CASE_1£H, ' \n ' ) ; 

if ((temp_match » namecmp(x->name, name)) FALSE) { 
if (temp_match > match) 

match - tenip_match; 

50 namejptr « name_lin)c.next_case_ptr; 

} while (namejptr !» BHD); 
return (match) ; 
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/* This function cocnpares two strings. If all words in the shorter 
st rin g are foiind in the longer string in the same order, then it returns 
TICTT. If >« 3 words are the same in the same order it returns LOOSE. If 
> %60 of words match it also returns LOOSE. Else it returns FALSE. ♦/ 

5 int case_table_type : :namecmp (char ♦a, char ♦b) 

char nl ICASE_NAME_LEN] , ♦n2. ♦ten?); 

char ♦wl ; 
int wd_mtch » 0; 

10 int %rds • 0; 

int match - TIGHT ; 

if (strstriCa, b) !» NULL) 

return (VERY_TIGHT) ; 
if (3trlen(a) < atrlen(b)) { 
15 strcpy(nl, a) ; 

n2 = b; 

else { ^ 

strcpyfnl, b) ; 
20 n2 « a; 

} 

wl = strtokCnl, ■;: ,.-■); 
while (wl !s NULL) { 

if (words - >real_word (wl ) ) { 
25 if ((temp - 8trstri(n2, wl) ) !- NULL) { 

wd__mtch++ ; 
n2 - temp; 

-rt slse { 

30 match - LOOSE ; 

} 

%^ds-^•*•; 
) 

2^ wl m strto)c(NULL, ,.-"); , , . ^ 

if (match && trds !• 0) { 

if (match «» TIGHT && wd_mtch > 1) 

return (TIGHT) ; 
if (wd_mtch >- 4 && ( (float) t#d_mtch / (float) wds > .8)) 
40 return (QUITE TI<^) ; 

if (wd_mtch >- l |T ( (float) wd_ratch / (float) wds > .5)) 
return (LOOSE) ; 

return (FALSE) ; 
45 } 



If tiie case matching filter finds no match (i.e., returns FALSE) for all candidate 
records* a new record is added to the database, as has been mentioned. 
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If, on Che otter hand, a match is found, any new tnforaiation about the case from the 
current reference (e.g., a parallel citation to an unofficial reporter) is added to the 
record in the database 16 for the case reference. When information is added in this 
way to a case in the database, this newly extended case record is used as the search 
5 case for another search of the database, steps 224 and 226, using the same matching 
filter described above. If exacdy one matching case is found, the current case record 
is merged with the matching case record, step 228; if more than one matching case 
is found, all matching cases are merged into one record in the database. 

The database 16 so generated can be further processed to generate "ideal" case 
10 references that can be used in processing search queries, building hypertext links, and 
so on. Each case record in the database includes all variants of the case name that 
have been encountered and a use count for each variant. In one embodiment, the 
variant that appears most frequently is chosen as the ideal case name. In alternative 
embodiments, the ideal case name must also meet other criteria, such a having a 
15 particular form, such as "A v. B", or "In rc C". The ideal citation form for a 
reference will be, in one embodiment, a standard form, such as a "bhie book" or 
California Style Manual. In another embodiment, all known parallel citations are 
included in the ideal citation. This allows the preprocessor to provide alternate 
citations to a cited case even if such citations are not present in the case being 
20 viewed, if the alternate citation appears somewtere in the database of cases. In 
addition, in this process the form of case citations are corrected by the preprocessor 
so that they accord with generally accepted citations rules. 

Stamte citations are recognized by the preprocessor through template matching. The 
tenq)lates are largely based on conventional citation formats for the varioiis'Mtutes. 
25 Thus, unconventional methods of citing stamtes by individual judges may affect the 
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accuracy of statutes extraction. A certain range of formats can be tolerated as 
different tenq)lates can be set up to deal with discrepancies in the format of citations 
to stamtes. Citations in an unconventional format are converted to the appropriate 
generally-accepted format. 

5 The use of matching templates relies to some degree on a knowledge of the 
jurisdiction of the case. For exanq>le;"a refercKe to the "Evidence Code" in a 
California case means a different thing than identical reference in a New York case. 
These problems are taken care of by marching some templates only if the case 
belongs to certain jurisdictions. Also, matching can occur based on the use of a word 
10 such as "Act" or "Code" in the case (as in "Copyright Act" or "Bankniptcy Code"). 

References to sections (and subsecuons) of stamtes are also identified by template 
matching. These references are associated widi a stamte primarily on the basis of 
the proximity of the section reference in the text to a reference to a statute. 

Legal concepts are automatically extracted from text by matching sections of the text 
IS with terms contained in a legal concept dictionary, which is constructed by hand. 
The dictionary is a domain lexicon of words or phrases used by legal professionals. 
Each term in the dictionary consists of the stems of one or more words, which are 
matched to the text, and a legal concept, to which the stem is linked. Most concept 
phrases in the dictionary are associated with more than one stem, thus allowing users 
20 to retrieve documents containirtg terms that are synoi^mous or semantically similar 
to concept phrases selected as search terms by the user. The dictionary also 
distinguishes between entries that require an exaa match versus entries that allow 
the matched information to appear in die text in any order, to be suffixed, or to be 
separated by noise words. 
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CozKrepts represent ideas. A legal concept generally gets its meaning from its 
relationship to a set of concepts which together constitute a legal doctrine or a 
procedural practice. Wh^her or not a word or phrase is, or represents, a legal 
concept ought to be mea^ired in terms of its relationship to a particular doctrinal 
5 structure or sets of structures. For example "confession" is a legal concept because 
it has a semantic relationship to a doctrinal structure about responsibility and 
evidence, which in turn is part of a doctrinal sub-set of criminal law. 

Since almost any kind of idea constitutes a concept of some kind or other, no sharp 
distinction can be drawn between the legal sub-language and the language of natural 

10 discourse. Some of the concepts of legal discom^ are highly technical and unique 
to law, others have a great deal to do with the law, but are widely used in general 
discourse. Still other concepts are non-legal but are, in fact, extensively used in the 
discourse of law. The goal in constructing a concept dictionary is to include in it only 
concepts that function as signifiers in that they are generally correlated with factual 

15 concepts that function as the signified. 

The legal concept dictionary is constructed from a wide variety of legal sources such 
as legal dictionaries, thesauri, statutes, nidexes, learned authorities, and treatises. 
The dictionary inchides syxx>nym information and allows (when s^licable) the 
matched information to zppc^r in the text in any order, to be suffixed, and to have 
20 the words in the concept phrase separated by noise words. 

As has been mentioned, a typical case contains a factual story, a descr^tion of the 
set of legal issues which the story gives rise to, a statement of tte applicable law, and 
u^ - . a resohiiion-of the issues when the law has been a{^Iied to the facts!" After 'the'' 
removal of legal concepts that are recognized by the legal concept dictionary, and of 
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case and statute citations that are recognized by teniplate matrhing functions, the 
remaining text rq)resems the facts of the case. Single word fact terms can be 
generated by removing noise words as determined by a noise word list. However, an 
improved indexing and query representation can be achieved by incorporating multi- 
S word fact terms in both document and query representations. Unlike the recognition 
of legal concept phrases, which is dictionary based, the recognition of fact phrases is 
based on automatic sentence construct analysis. The underlying technology is 
described in the following references, whose di^losures are included here by this 
reference: Dillon, M. and Gray, S.,FASIT: A Fully Automatic Syntactically Based 

10 Indexing System, Journal of the American Society of Information Science 34 (1983); 
Fagan, J.. Experiments m Automatic Phrase Indexing for Document Retrieval: A 
Comparison of Syntactic and Non-Syntactic Methods, Ph,D. Thesis, Tedtmcat Report 
87-868. Cornell University. Computer Science Department (1987); Smeaton, A.. 
Infbrmadon Retrieval Research: How It Might Affect the Practicing Lawyer, in S. 

15 Nagel, ed..Lzw, Decision-making and Micro Computers (1991); and Croft, W,B., 
Turtles, H.R., and Lewis, D.D., The use of Phrases axui Structured (Queries in 
Information Retrieval, ]n Proceedings of the ACM SIGIR Conference on Research and 
Development in Irtformation Retrieval (1991). - ^ v*. 

The preprocessor recognizes multi-word fact phrases by combining term distribution 
20 and proximity information with a lexicon of noise words. It defines categories of 
noise words and uses them as 'glue** connecting fact terms into phrases. "Joiners" 
(e.g.. "by","or)join two feet terms; "modifiers" (e.g., "extended", "civil") qualify or 
constrain terms; and "pure noise" is eliminated. The preprocessor also uses a set of 
rules to identify classes of noise terms such as names and numbers diat are retained 
25 in the conoext of citations. While this simple approach to phrase recognition can 
result in some meaningless, useless terms, or over-specific terms, their occurrence is 
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minnniTed by the use of corpus filtering; i.e., the elimination of terms whose 
collection frequency falls below a given threshold and by applying constraints on the 
length of automatically determined terms. 

Returning to FIGURE 1. with the concepts, case references, statutes, and facts 
5 extracted from a database of cases IS - e.g., one representing a body of case law - 
the database of cases 15 can be searched with a query having terms of one or more" 
types or categories (i.e., from one or more of the profile quadrants). The search and 
retrieval model that will be described is an adaptation of the vector space model, 
described in Salton, G., and McGill, M.J., Introduction to Modem Irtformation 
10 Retrieval (1983), the disclosure of which is iiK:orporated by this reference. 

Turning to FIGURE 3, the retrieval model represents bodi documents and 
queries, step 310. as lists of weighted words and phrases. Relevant documents are 
retrieved by conq>aring a query conq)osed of keyword terms of one or more types to 
document profiles stored in a database. At stq> 312, the candidate list of documents 
IS against which the query is tested is made iq) of those documents in which any of the 
search terms are found. (If the NOT modifier is used, the terms it modifies are npt,^,^ 
included in making the initial selection.) 

At step 314, the similarity between a document and query is calculated with a metric 
based on the cosine formula, which is described in Salton, G., Automatic Text 
20 Processings Addison-Wesiey, 1989. The cosine formula measures the extern of a 
match b^een the document and query and produces a similarity score with respect 
to a document y\ taking a normalized inner product between a query vector and a 
document vector, as fbilows, ^ * ' 

where 
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and where 



= '" , (1) 



1-1 



EC// 



5im^ = the calculated similarity of a query to a particular 

document j 
R the number of terms in database 

= the user assigned weight for term / 
~ the weight of term / in document j (this is a vector in i 
only) 



4=/,xlogil (2) 



where 

/.. = the number of times term i ^)pears in document j 

10 A/^ = the total number of documents in the collection 

Hi = the number of documents contain term i 

This formula can be ejq)res5ed more aiccincdy using conventional vector notation* 
where 

fi = the vector of values qi 

15 Dy = tte vector of values 

by which the similari^ measure set forth above can be e;q>ressed (step 322) as 
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5un, = 



Q Dy 



(3) 



l<?I X IDjl 



The vector space metlKxl and cosine formula were developed to measure the 
similarity of one document to other documents in a collection. The values in that 
context, were proportional to the number of times the i-th term ^>pears in the 
document that serves as a query. In the present method, the values have a default 
5 value of 1, bm are user selectable to other values. User interface categories of 
High, Medium, Low, and Not simplify the user's task in selecting these values and 
are translated into the numeric values used in the similarity calculation. The Not 
category actually represents a negative number, resulting in documents that contain 
the term to have a lower score than they would otherwise. Thus, the retrieval model 
10 allows the user to identify the importance of a term. 

The retrieval model differs from the basic cosine formula (1), above, in other 
respects as well. The document weight factor is normally calculated in the retrieval 
model using the square root of the frequency factor, as follows, rather than formula 
(2), above. 



15 Also, at optional steps 320 and 324, the model provides for a quadram-based 
weighting factor ki,, (The subscript A selects one of the fcnir types, just as the 
subscript i selected one of the terms in the database.) Thus, with quadram-based 
weighting the similari^ is measured by the type of the query term (that is, in the 




(4) 



four-quadram profile set forth above, by whether the query term is one of concept, 
20 case, stamte, or fact), aiKl the similarity for each type is weighted by the factor ki,. 
This is expressed (steps 322 and 324) in the following formula. 
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^ ^ ^ (5) 

where the subscript /t indicates that the vectors and are limited to the tenns 
in the respective quadrants. 

Alternatively, the normalization is done on the document as a whole, thus. 

^ t<?y X ID I 



(This is an exanq>le of normalization step 326.) Also, tte retrieval model provides 
5 for the weighting of the similarity score by a user-selectable factor, step 328. The 
alternatives are (a) the factor 1, (b) die number of quadrants m which a match was 
found, (c) the square root of the number of quadrants in which a match was found, 
and (d) the ratio of the number of terms in the query found in document J divided 
by the total number of terms in the query. Altemadve (d) scales the similarity score 
10 to die ratio of the total number of query terms found to the total number of query 
terms* Use of this alternative (d) is the default configuradon. This en:q>hasizes 
documents that have many matching terms over documents that have one term 
m a t c hing many times. Alternative (d) may be used in combinadon with either 
altemadve (b) or (c). 

15 The model also allows the user to scale the similarity score on a quadrant (type) 
basis with a factor that is the number of terms of the type (i.e., in the quadrant) in 
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the query found tn document j divided by the total number of terms of the type in 
the query. 

As can be seen, the retrieval model allows the user to weigh document search terms 
to reduce the inq>ortance of common words in a search while mainraining the 
5 importance of multqile hits within a document. By using the length of the query and 
the length of the document (mea^ired by the immber of terms -that occur in each) 
to normalize the score, queries or documents that are particularly verbose do not 
tend to swamp the results. 

The present invention has been described in terms of specific embodiments. The 
10 invention, however, is not limited to these specific embodiments. Rather, the scope 
of the invention is defined by the following claims, and other embodiments are within 
the scope of the claims. For example, the metric for ranking need iK>t be the cosine 
formula, or a formula derived from die cosine formula. Other metrics, iix:hiding a 
probabilistic metric, such as, for exanq>le, a Bayesian belief network, can also be 
15 used. 
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CLAIMS 

What is claimed is: 

1. A method for matching a current case TeferezK:e to a set of case references, 
comprising: 

5 (a)" ' 'maintaining a database of the case references that have been process^ by 

the method; 

(b) parsing the curreiu reference for its citations and, for each citation (the 
ctirrent citation) in the current reference, parsing the current citation into 
its volume, reporter, and page; 
10 (c) searching the database of the case references for a mafrhing case by 

applying a set of tests to the cases in the database of the case references, 
the set of tests iiKrluding: 

(A) if a candidate case matches the current case in two volume- 
reporter-page citations from different reporters, the candidate 
15 case is a match. 

2. The method of claim 1, further comprising: 

(d) parsmg the current refereixre for its party names and, for each non-noise 
word in each party name in die current referexu:e, acquiring a soimd-alike 
value for the noinnoise word; 
20 (e) searching the database of the case references for a matching case by 

applying a set of tests to the cases in the database of the case references, 
the set of tests including: 

(A) if a candidate case matctes the current case in one citation, and 
the cases' names match at least loosely, and neither case has a 
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citation that is inconsistent with the citations of the other case, 
the candidate case is a match. 

The method of claim 2, further comprising: 
(0 initializing a candidate set to be emp^; 

(g) adding to the candidate set a case reference from the database if it has a 
same party, name as the current reference, by sound-alike valties; 

(h) addmg to the candidate set a case reference from the database if the case 
reference has a volume-reporter-page citation matching the current 
citation; 

(i) searching the candidate set rather than the entire database of case 
references for a matching case by applying a set of tests to the cases in the 
candidate set. 



4. A method for matching a current case reference to a set of case references, 
conq)rising: 

15 (a) maintaining a database of the case references that have been processed by 

the method; 

(b) initializing a candidate set to be empty; 

(c) parsing the current reference for its citations and, for each citation (the 
current citation) in the current reference, 

(1) parsing the current citation into its volume, reporter, page, court, and 
year, and then 

(2) adding to the candidate list a case reference from the database if the 
case reference has a voliune-reporter-page citation matching the 
current citation; 



20 
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(d) searching the candidate list for a matching case by applying a set of tests 
to the cases in the candidate list, the set of tests including a test selected 
from the group consisting of: 

(A) if a candidate case matches the current case in two volume- 
5 reporter-page citations from different reporters, and the court 

information for the two cases is not iix:onsistent, the candidate 
case is a matclir"^ 

(B) if a candidate case matches the current case in two volume- 
reporter-page citations from different reporters, and the year 

10 information is not inconsistent, the candidate case is a match, 

and 

(Q if a candidate case matches the cuneiu case in one citation, and 
both the court information and the year information for the two 
cases is not inconsistent, and neither case has a citation that is 
IS inconsistent with the citations of the other case, the candidate 

case is a match. 

S. The method of claim 4, furdier conq)rising: 

(e) parsing the cuirent reference for its party names and, for each non-noise 
word in each party name m the current reference, acquiring a sound-alike 
20 value for the non-noise word; 

(0 adding to the candidate set a case reference from the database if it has a 

same par^ name as the current reference, by sound-alike values; 
(g) searching the candidate set for a mafrhing case by applying a set of tests 
to the cases in the candidate set, the set of tests including a test selected 
2S firom the gmxp consisting of: 



L 
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(A) if a candidate case matches the current case in one citation, and 
the cases' names match at least loosely, and neither case has a 
citation that is inconsistent with the citations of the other case, 
the candidate case is a match; 

(B) if a candidate case matches the current case in one citation, and 
the courts and the years also match, the cases' nampg match very 
tightly, and the cases have less than two citations that are 
inconsistent from one case to the other, the candidate case is a 
match; and 

(C) if a candidate case matches the current case in bodi courts and 
the years, the cases' names match very tightly, and neither case 
has a citation that is inconsistent with the citations of the other 
case, the candidate case is a match. 



6. A method for ranking the relevance of a target document fouiKl by a search 
15 query in a set of documents, where the query has search terms, the method 

comprising: 

(a) providing a set of weighting factors defined by a user for the search query, 
at least one of which weighting factors differir^ from the other weighting 
factors in the set of weighting factors and at least one weighting factor 

20 having a negative vahie, whereby each search term has an associated 

weighting factor; 

(b) flying a metric function to the search query, the weighting factors, and 
the target document to produce a similarity measure for the target 
documem against die search query weighted by the weighting factors; and 

25 ' (c) ranking the target document by its similarity iheasure: ' " 
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7. The method of claim 6 wherein the metric function inchides a calculation of an 
inner product between a search query and the documents in the set over a 
document term vector space. 

8. A method for ranking the relevance of a target document found by a search 
S query in a set of documents, where the documents in the set of documents have 

document terms;^ each dotumem term being of one of a plurality of types, and 
where the search query has search terms, each search term being of one of the 
plurality of types, the method comprising: 

(a) for each type in the plurality of types, applying a metric function to those 
10 terms of the search query aiKl the target document having the type, to 

produce a type-based similarity measure for die target document against 
the search query for each type in the plurality of types; 

(b) combining the type-based similarity measures by applying a user-selectable 
type-based weight to each of the target document's type-based sunilarity 

15 measures to produce a final similarity measure; and 

(c) ranking the target document by the final similarity measure. 

9. A method for ranking the relevance of a target document found by a search 
query in a set of docummts, where the documents in the set of documents have 
document terms, each document term beiiig of one of a plurality of ^pes, and 

20 where the search query has search terms, each search term being of one of the 

plurality of types, the method cotrprising: 

(a) for each type in the plurality of types, applying a metric function to those 
terms of the search query and the target document having the type, to 
produce a type-based similarity measure for the target document against 
25 the search query for each type in the phiraiity of types; 
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(b) combining the type-based similarity measures and applying a factor to the 
similarity measures based on the number of different types for which some 
search term matched a target document to produce a final similarity 
measure; and 

S (c) ranking the target document by the final similarity measure. 



10. The method of claim 9 further comprising: 

providing a set of weighting factors for the search query, at least one of which 
weighting factors diffiers from the other weighting factors in the set of 
weighting factors, whereby each search term has an associated weighting 
10 factor; and wherein 

the step of applying a metric function further comprises applying a metric 
function to the search query, the weighting factors, and the target 
document to produce a similarity measure for the target document against 
the search query weighted by the weighting factors. 



IS 11. The method of claim 10 wherein the metric function includes a calculation of 
an inner product between a search query and the documents in the set over a 
document term vector space. 

12. The method of claim 10 wherein the step of providing a set of weighting factors 
comprises providing a set of weighting factors defined by a user for the search 
20 query. 



13. 



The method of claim 12 wherein the step of providing a set of weighting factors 
comprises providing at least one weighting factor having a negative value. 
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